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ABSTRACT 

This  paper  explores  a  two-dimensional  (2-D)  processing 
approach  for  co-channel  speaker  separation  of  voiced  speech. 
We  analyze  localized  time-frequency  regions  of  a  narrowband 
spectrogram  using  2-D  Fourier  transforms  and  propose  a  2-D 
amplitude  modulation  model  based  on  pitch  information  for 
single  and  multi-speaker  content  in  each  region.  Our  model 
maps  harmonically-related  speech  content  to  concentrated 
entities  in  a  transformed  2-D  space,  thereby  motivating  2-D 
demodulation  of  the  spectrogram  for  analysis/synthesis  and 
speaker  separation.  Using  a  priori  pitch  estimates  of  individual 
speakers,  we  show  through  a  quantitative  evaluation:  1)  Utility 
of  the  model  for  representing  speech  content  of  a  single  speaker 
and  2)  Its  feasibility  for  speaker  separation.  For  the  separation 
task,  we  also  illustrate  benefits  of  the  model's  representation  of 
pitch  dynamics  relative  to  a  sinusoidal-based  separation  system. 
Index  Terms —  Grating  Compression  Transform,  speaker 
separation,  spectrogram  demodulation,  2-D  speech  analysis 

1.  INTRODUCTION 

Co-channel  speaker  separation  is  a  challenging  task  in  audio 
processing.  For  all-voiced  speech,  current  methods  operate  on 
short-time  frames  of  mixture  signals  (e.g.,  harmonic 
suppression,  sinusoidal  analysis,  modulation  spectrum  [1  -  3]) 
or  on  single  units  of  a  time-frequency  distribution  (e.g.,  binary 
masking  [4]).  Alternatively,  this  paper  proposes  and  assesses 
the  feasibility  of  a  2-D  analysis  framework  for  this  task.  We 
analyze  localized  time-frequency  regions  of  a  narrowband 
spectrogram  using  2-D  Fourier  transforms,  a  representation  we 
refer  to  as  the  Grating  Compression  Transform  (GCT). 

The  GCT  has  been  explored  by  Quatieri  [5],  Ezzat  et  al  [6,  7], 
and  Wang  and  Quatieri  [8]  primarily  for  single-speaker  analysis 
and  is  consistent  with  physiological  modeling  studies 
implicating  2-D  analysis  of  sounds  by  auditory  cortex  neurons 
[9].  Ezzat  et  al.  performed  analysis/synthesis  of  a  single 
speaker  using  2-D  demodulation  of  the  spectrogram  [7],  In  [8], 
we  proposed  an  alternative  2-D  modulation  model  for  formant 
analysis.  Phenomenological  observations  in  [5,  6]  have  also 
suggested  that  the  GCT  invokes  separability  of  multiple 
speakers.  Finally,  in  recent  work,  we  have  demonstrated  the 
GCT's  ability  in  analysis  of  multi-pitch  signals  [10].  This  paper 
builds  on  these  previous  efforts  in  several  ways. 
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First,  in  Section  2.1,  we  investigate  GCT  analysis  of  a  single 
speaker  using  a  2-D  amplitude  modulation  (AM)  model  based 
on  pitch  information.  Section  2.2  extends  this  model  to 
analysis  of  multiple  speakers  to  account  for  the  observations 
made  in  [5,6]  regarding  speaker  separability  in  the  GCT.  Our 
framework  motivates  2-D  sinusoidal  demodulation  of  the 
spectrogram  for:  1)  single-speaker  analysis/syn thesis  and  2) 
speaker  separation.  Section  3  describes  algorithms  for  these 
tasks.  Section  4  presents  a  quantitative  evaluation  of  these 
methods  on  real  speech  to  assess:  1)  Utility  of  the  AM  model  in 
representing  speech  content  of  a  single  speaker  and  2)  Its 
fecisbility  for  the  separation  task  using  a  priori  pitch  estimates 
of  individual  speakers.  As  a  baseline,  we  compare  against  a 
sinusoidal-based  separation  system  that  similarly  uses  such 
pitch  estimates  [2].  Section  5  concludes  with  future  directions. 

2.  2-D  PROCESSING  FRAMEWORK 

2.1.  Single-speaker  Model 

Consider  a  localized  time-frequency  region  s\n,m ]  (discrete¬ 
time  and  frequency  n,  m)  of  a  narrowband  short-time  Fourier 
transform  magnitude  (STFTM)  (Figure  1)  computed  for  a 
single  voiced  utterance.  Here,  we  extend  a  2-D  amplitude 
modulation  (AM)  model  from  our  previous  work  [8]  such  that 

s[n,m]  ~  (a0  +  cos(<J>[fi,7n]))fl[n,m] 

<F[n,m]  =  cofncosO  +  m  sin#)  +  <p. 

i.e.,  a  sinusoid  with  spatial  frequency  a)s ,  orientation  0  ,  and 
phase  cp  rests  on  a  DC  pedestal  aQ  and  modulates  a  slowly- 
varying  envelope  a[n,m ]  .  The  2-D  Fourier  transform  of 
s[n,m]  (i.e.,  the  GCT)  is 

S(0),O)  =  a0A(o),O)  +  Q.5e~J'l’A(0)+  cos  sin  8,0.  -  cos  cos 0) 

+0.5 eivA(co- cos  sin  0,0.  +  as  cos 6) 

where  co  and  O  map  to  n  and  m,  respectively.  The  sinusoid 
represents  the  harmonic  structure  associated  with  the  speaker's 
pitch  [5,  10].  Denoting/]  as  the  waveform  sampling  frequency 
and  Nstft  as  the  discrete-Fourier  transform  (DFT)  length  of 
the  STFT,  the  GCT  parameters  relate  to  the  speaker's  pitch  (  /„ ) 
at  the  center  (in  time)  of  s[n,m\  (Figure  lb,  c)  [5,  10]: 

/o  =  (2^/J)/(AX7fT0!cos#).  (3) 


1  This  work  was  supported  by  the  Department  of  Defense  under  Air  Force  contract  FA8721-05-C-0002.  The  opinions,  interpretations, 
conclusions,  and  recommendations  are  those  of  the  authors  and  are  not  necessarily  endorsed  by  the  United  States  government. 
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A  change  in  f0  (  A/0 )  across  An  results  in  an  absolute  change 
in  frequency  of  the  klh  pitch  harmonic  by  kAf0 .  Therefore,  in  a 
localized  time-frequency  region  (Figure  lb) 

tanO  ~(kAf0)j An  .  (4) 


0 .5eJ*A(ea  -  cos  sin  6,0  +  co5  cos  6) 
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Figure  1.  (a)  Schematic  of  full  STFTM  with  localized  time-frequency  region 
centered  at  tcenter  and  Jcenter  for  GCT  analysis  (rectangle);  (b)  Localized  region 
of  (a)  with  harmonic  structure  (parallel  lines)  and  envelope  (shaded);  triangle 
indicates  spacing  between  harmonic  lines;  note  also  relation  between  6  ,  kAf0  , 

and  An  ;  (c)  GCT  of  (a)  with  baseband  (dashed)  and  modulated  (shaded)  versions 
of  the  envelope;  (d)  Demodulation  to  recover  near-DC  terms. 

For  a  particular  s[n,m\  with  center  frequency  fceMer  (Figure 
la),  f0  can  be  obtained  from  (3)  such  that  k  ~  fcenur/ f0  •  The 
rate  of  change  of  /„  (  3/0/3f  )  in  s[n,m]  is  then 

dfo  /dt  =  A f0 /An  =  ( /„  tan  6)/  fcemer  .  (5) 

Finally,  (p  corresponds  to  the  position  of  the  sinusoid  in 
s[«,/n]  ;  for  a  non-negative  DC  value  of  a[n,m \  ,  cp  can  be 
obtained  by  analyzing  the  GCT  at  (a>=  COs  sin 6,  Q.  =  cos  cos#) 

<p  =  angle[S(COss,md,  ®scos#)]  .  (6) 

Our  model  maps  harmonically  related  speech  content  in  each 
s[n,m\  to  concentrated  entities  in  the  GCT  near  DC  and  at  2-D 
"carriers"  (Figure  lc).  Observe  that  if  the  near-DC  terms  were 
removed  or  corrupted,  our  model  motivates  approximate 
recovery  of  the  near-DC  terms  from  the  carrier  terms  using 
sinusoidal  demodulation  (Figure  Id).  Using  demodulation,  the 
full  STFTM  can  then  be  recovered  and  combined  with  the 
STFT  phase  for  approximate  waveform  reconstruction. 

2.2.  Multi-speaker  Extension 

In  [5.  6],  the  GCT  space  was  suggested  to  separate  multiple 
speakers.  To  account  for  these  observations,  we  approximate 
the  STFTM  computed  for  a  mixture  of  N  speakers  in  a  localized 
time-frequency  region  x[n,m]  as  the  sum  of  their  individual 
magnitudes.  Using  the  model  of  (1),  we  then  have 

N 

x[n,m]  ~  '^JCXoia  t[n,m] 

i=i 

N 

+y~’  al  [w,  m]  cos  ( fflj.  [n  cos  6i  +  m  sin  9i  ]  +  (pt ). 

i=i 


Equation  (7)  invokes  the  sparsity  of  harmonic  line  structure 
from  distinct  speakers  in  the  STFTM  (i.e.,  when  harmonic 
components  of  speakers'  are  located  at  different  frequencies). 
Nonetheless,  separation  of  speaker  content  in  the  GCT  can  still 
be  maintained  when  speakers  exhibit  harmonics  located  at 
identical  frequencies  (e.g.,  due  to  having  the  same  pitch  values, 
when  pitch  values  are  integer  multiples  of  each  other)  due  to  its 
representation  of  pitch  dynamics  through  6  in  (7)  [10].  An 
example  of  this  is  shown  schematically  in  Figure  2a-b,  where 
two  speakers  have  equal  pitch  values  but  distinct  pitch 
dynamics,  thereby  allowing  separability  in  the  GCT. 


Figure  2.  (a)  Localized  time-frequency  region  of  STFTM  computed  on  a  mixture 
of  two  speakers;  speaker  with  rising  pitch  and  falling  formant  structure  (solid,  red); 
speaker  with  stationary  pitch  and  stationary  formant  (dashed,  blue);  yellow  arrow 
denotes  same  pitch  value  at  center  of  region  (b)  GCT  of  (a)  showing  overlap  of 
near-DC  terms  (green  rectangle);  speakers  exhibit  the  same  vertical  distances 
(black  arrow)  from  the  CO  -axis  corresponding  to  equal  pitch  values;  separability  is 

maintained  due  to  distinct  angular  positions  off  of  the  £1  -axis;  (c)  Demodulation 
to  recover  near-DC  terms  of  one  speaker. 

The  2-D  Fourier  transform  of  (7)  is 

N  N 

X  ( at,  Q.)  =  ^  ct0 . A,.  ( a),  £1)  +  0.5^  A,.  (co+  co  sin  6 ,  Q  -  co  cos  9  )e~m 

,.i  ■  ,-.i  ( 

N 

+0.5^  Aj  (to- coj  sin  9i , Q  +  mi  cos 9j  )em . 

i=l 

For  slowly-varying  ^((0,0.)  ,  the  contribution  to  X(a>,kl) 
from  multiple  speakers  exhibits  overlap  near  the  GCT  origin 
(Figure  2b);  however,  as  in  the  single-speaker  case,  A,  (to, £2) 
can  be  estimated  through  sinusoidal  demodulation  according  to 
the  proposed  model.  This  model  therefore  motivates  localized 
2-D  demodulation  of  the  STFTM  computed  for  a  mixture  of 
speakers  for  the  speaker  separation  task  (Figure  2c). 

3.  ALGORITHMS 

Herein  we  discuss  algorithms  motivated  by  the  models  of 
Section  2.  Section  3.1  discusses  2-D  demodulation  of  the 
STFTM  for  analysis/synthesis  of  a  single  speaker.  Our 
approach  is  distinct  from  work  by  Ezzat  et  al.  in  which 
scattered  data  interpolation  was  used  for  demodulation  [7],  In 
this  work,  we  apply  sinusoidal  demodulation  in  conjunction 
with  a  least-squared  error  fit  to  estimate  the  gain  parameter  in 
'  (1).  Section  3.2  describes  a  similar  algorithm  for  the  speaker 

separation  task.  Both  methods  assume  a  priori  pitch  estimates 
of  individual  speakers. 
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3.1.  Single-speaker  Analysis/Synthesis 

To  assess  the  AM  model's  ability  to  represent  speech  content  of 
a  single  speaker,  an  STFT  is  computed  for  the  signal  using  a 
20-ms  Hamming  window,  1-ms  frame  interval,  and  512-point 
DFT.  From  the  full  STFTM  (  sF[n,m]  ),  localized  regions 
centered  at  k  and  I  in  time  and  frequency  ( su[n,m\  )  of  size  625 
Hz  by  100  ms  are  extracted  using  a  2-D  Hamming  window 
(  wh[n,m] )  for  GCT  analysis.  We  then  apply  a  high-pass  filter 

hhp[n,m\  to  each  sa[n,m]  to  remove  a0A(a>,kl)  in  (2);  we 
denote  this  result  as  skl  hp[n,m]  .  hhp[n,m\  is  a  circular  filter 

with  cut-offs  at  a>  =  £l  =  0.17T ,  corresponding  in  co  to  a  -300 
Hz  upper  limit  of  /„  values  observed  in  analysis. 


For  each  sMhp[n,m]  ,  we  aim  to  approximately  recover 
a0A(co,£l)  using  2-D  sinusoidal  demodulation.  The 
carrier  (cos(<J>[;;,m]))  parameters  are  determined  from  the 
speaker's  pitch  track  using  (3)  for  cos  and  (6)  for  (p  .  To 
determine  6  ,  a  linear  least-squared  error  fit  is  applied  to  the 
pitch  values  spanning  the  100-ms  duration  of  skl hp[n,m] .  The 
slope  of  this  fit  approximates  df0/dt  such  that  6  is  estimated 
using  (5).  sahp[n,m]  is  multiplied  by  the  carrier  generated 

from  these  parameters  followed  by  filtering  with  a  circular  low- 
pass  filter  hlp[n,m ]  with  cut-offs  at  co=Q.  =  0.lrr :  we  denote 

this  result  as  a[n,m]  .  a[n,m ]  is  combined  with  the  carrier 
using  (1)  and  set  equal  to  sa[n,m] 

skl[n,m]  =  (a0  +cos(4>[n,tw]))«[M,m] .  (9) 


For  each  time-frequency  unit  of  sa[n,m] ,  (9)  corresponds  to  a 
linear  equation  in  a0  since  the  values  of  skl \n,m\ ,  a[n,m\  ,  and 
cos(4>[j!,m])  are  known.  This  overdetermined  set  of  equations 
is  solved  in  the  least-squared  error  (LSE)  sense.  The  resulting 
estimate  of  su[n,m ]  using  the  estimated  a0  ,  d\n,m\  ,  and 
cos(<F[«,m])  is  denoted  as  su[n,m\  .  The  full  STFTM 
estimate  sF[n,m ]  is  obtained  using  overlap-add  (OLA)  with  a 
LSE  criterion  (OLA-LSE)  [11] 


sF[n,m] 


X  wh  [kT  -nJF  -  m\su  [ n,m ] 

k  I _ 

'Ej'HWh[kT  -  n’lF  - 

k  l 


(10) 


OLA  step  sizes  in  time  and  frequency  (T  and  F)  are  set  to  1/4 
the  size  of  wh  \n,m\ .  SF[n,m]  is  then  combined  with  the  STFT 
phase  for  waveform  reconstruction  using  OLA-LSE  [  1 1 J . 


3.2.  Speaker  Separation 

For  speaker  separation,  the  demodulation  steps  are  nearly 
identical  to  those  in  Section  3.1  but  applied  to  the  mixture 
signal.  Briefly,  let  xu[n,m]  be  a  localized  region  of  the  full 
STFTM  computed  for  the  mixture  signal  centered  at  k  and  1  in 
time  and  frequency.  xkl \n,m\  is  filtered  with  hhp[n,m]  to 

remove  the  overlapping  a0iAj(a>,Q.)  terms  at  the  GCT  origin 


(Figure  2b);  we  denote  this  result  as  xuhp[n,m]  .  A  cosine 

carrier  for  each  speaker  is  generated  using  the  corresponding 
pitch  track  and  multiplied  by  xkl hp[n, m ]  to  obtain 

xai  [n,m]  =  xkIhp  [n,  m]cos(ffi»  [ n  sin  6j  +  m  cos  6i  ]  +  <pt ) 

=  at\n,m\  +  c[n,m]. 

If  the  speakers'  carriers  are  in  distinct  locations  of  the  GCT, 
c[n,m\  summarizes  cross  terms  away  from  the  GCT  origin 
such  that  cjw.m]  can  be  obtained  by  filtering  xkIi[n,m]  with 
hlp[n,m\  .  For  each  speaker,  afn, m]  is  combined  with  its 

respective  carrier  using  (1).  These  results  are  summed  and  set 
equal  to  xkl\n,m\  to  solve  for  a0i  in  the  LSE  sense: 

N 

xkl[n,m\  =  y'fceoi  +cos(<5[M,m]))aj[«,m]  (12) 

i=l 

Recall  that  the  GCT  represents  pitch  and  pitch  dynamics;  it 
may  therefore  invoke  improved  speaker  separability  over 
representations  relying  solely  on  harmonic  sparsity  (Section 
2.2).  In  a  region  where  speakers  have  equal  pitch  values  and 
the  same  temporal  dynamics,  however.  (12)  invokes  a  near¬ 
singular  matrix.  To  address  this,  we  compute  the  angle 
between  the  dfnjn]  columns  of  the  matrix.  When  this  angle 

is  below  a  threshold  of  nt/ 10 ,  the  a0i  is  solved  for  by  reducing 
the  matrix  rank  to  that  corresponding  to  a  single  speaker. 

Finally,  the  estimated  full  STFTMs  of  the  target  speakers  are 
reconstructed  using  (10).  Speaker  waveforms  are  then 
reconstructed  using  OLA-LSE  by  combining  the  estimated 
STFTMs  with  the  STFT  phase  of  the  mixture  signal. 

4.  PRELIMNARY  EVALUATION 

This  section  describes  preliminary  evaluations  of  the  algorithms 
of  Sections  3.1  (denoted  as  Expl)  and  3.2  (Exp2).  We 
analyzed  two  all-voiced  sentences  sampled  at  8  kHz  ("Why 
were  you  away  a  year,  Roy?"  and  "Nanny  may  know  my 
meaning")  spoken  by  10  males  and  females  (40  total  sentences). 
Pitch  estimates  of  the  individual  sentences  were  determined 
prior  to  analysis  from  an  autocorrelation-based  pitch  tracker. 

In  Expl,  we  perform  analysis/syn thesis  of  a  single  speaker  as 
described  in  Section  3.1.  For  comparison,  we  also  generated  a 
waveform  by  filtering  sF[n,m ]  with  an  adaptive  filter 

hs[n,m]  =  hlp[n,m]{  1  +  2cos(0)  [n sin (9  +  mcos#]  +  <ps))  (13) 

where  0)s ,  6 ,  and  (ps  are  determined  for  each  localized  time- 
frequency  region  using  the  speaker's  pitch  track  and  hlp[n,m]  is 

that  described  in  Section  3.1.  The  filtered  STFTM  is  used  to 
recover  the  waveform  as  in  Section  3.1.  This  method  assesses 
the  value  of  the  model  for  representing  speech  content  of  a 
single  speaker,  independent  of  the  2-D  LSE  fitting  procedure. 

To  assess  the  feasibility  of  GCT-based  speaker  separation 
(Exp2),  we  analyzed  mixtures  of  two  sentences  (Nanny  +  Roy) 
spoken  by  10  males  and  females  mixed  at  0  dB  (90  mixtures 
total).  For  comparison,  we  used  a  baseline  sinewave-based 
separation  system  (SBSS);  SBSS  models  sinewave  amplitudes 
and  phases  given  their  frequencies  (e.g.,  harmonics)  for  each 


67 


2009  IEEE  Workshop  on  Applications  of  Signal  Processing  to  Audio  and  Acoustics 


October  18-21,  2009,  New  Paltz,  NY 


speech  signal  [2],  We  chose  this  baseline  for  comparison  as  it 
similarly  uses  a  priori  pitch  estimates  to  obtain  the  sinusoidal 
frequencies,  and  to  assess  potential  benefits  of  the  GCT's 
explicit  representation  of  pitch  dynamics  (Section  3.2). 
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Figure  3.  (a)  STFT  magnitude  of  single  speaker  sentence  (Roy);  (b)  Recovered 
STFT  magnitude  using  control  method;  (c)  As  in  (b)  but  using  demodulation;  (d)  A 
priori  pitch  estimates  of  sentence  in  (a)  -  (c). 
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Figure  4.  (a)  STFT  magnitude  of  mixture  (Nanny  +  Roy);  (b)  Recovered  STFT 
magnitude  of  Roy  sentence  using  SBSS  with  resulting  SNR  listed;  (c)  As  in  (b)  but 
using  demodulation;  (d)  A  priori  pitch  estimates  of  target  (blue)  and  interfering 
(red)  speakers;  pitch  tracks  exhibit  crossings  throughout  mixture. 


Table  I.  Average  SNRs  across  40  single  sentences  (Expl)  and  90  mixtures  (Exp2). 


Expl 

Filtering 

Expl 

Demod. 

Exp2 

SBSS 

Exp2 

Demod. 

Exp2 

TruePhase 

SNR  (dB) 

11.24 

12.51 

3.62 

4.09 

5.96 

Figure  3  shows  STFTMs  obtained  in  the  single-speaker 
experiment  and  a  priori  pitch  estimates.  In  this  example, 
demodulation  appears  to  provide  a  similar  reconstruction  as  the 
control  method.  In  Figure  4,  we  show  the  resulting  STFTMs 
for  the  separation  task  using  the  single-speaker  sentence  as  the 
target.  In  this  example,  the  pitch  tracks  of  the  target  and 
interferer  exhibit  crossings  (Figure  4d),  thereby  leading  to 
overlapping  harmonic  structure  in  the  mixture  STFTM. 
Qualitatively,  GCT  demodulation  appears  to  provide  a  more 
faithful  reconstruction  of  the  target  than  SBSS.  To  quantify  the 
performance  in  Expl  and  Exp2,  we  computed  average  signal- 
to-noise  ratios  (SNR)  of  the  original  and  reconstructed 
waveforms  (Table  I).  In  Expl,  demodulation  provides  a  better 
reconstruction  than  filtering  by  -1.3  dB.  One  possible  cause 
for  this  is  the  introduction  of  negative  magnitude  values  in  the 
filtered  STFTM.  These  effects  are  likely  minimized  in 
demodulation  through  the  LSE  fitting  procedure.  Nonetheless, 


both  methods  provide  good  reconstruction  of  the  waveform 
with  overall  SNR  >11  dB.  In  Exp2,  consistent  with  the 
recovered  STFTMs  (Figure  4),  demodulation  affords  a  larger 
gain  in  SNR  than  SBSS  in  the  example  shown  (captions,  Figure 
4b,  c)  and  on  average.  This  is  presumably  due  to  the  GCT's 
explicit  representation  of  pitch  dynamics.  In  informal  listening 
for  Expl,  subjects  (non-authors)  reported  no  perceptual 
difference  between  the  filtering  and  demodulation  methods  in 
relation  to  the  original  signal.  In  Exp2,  subjects  reported 
intelligible  reconstructions  of  the  target  speech  for  both 
methods  with  a  reduced  amplitude  of  the  interferer.  However, 
in  assessing  SBSS,  subjects  reported  that  the  interferer  sounded 
"metallic"  while  this  synthetic  quality  was  not  perceived  for  the 
GCT  system.  Though  more  formal  listening  tests  are  needed, 
these  observations  demonstrate  the  utility  of  the  AM  model  for 
representing  speech  content  of  a  single  speaker.  Furthermore, 
they  demonstrate  the  GCT's  feasibility  for  speaker  separation 
and  its  advantages  in  representing  pitch  dynamics  for  this  task. 

5.  CONCLUSIONS 

This  paper  has  introduced  a  2-D  processing  approach  for 
single-  and  multi-speaker  analysis.  We  have  quantitatively 
shown  that  a  2-D  modulation  model  accounting  for  near-DC 
terms  of  the  GCT  provides  good  representation  of  speech 
content  of  a  single  speaker.  We  have  also  shown  that  this 
model  is  a  promising  representation  for  co-channel  speaker 
separation.  For  the  separation  task,  one  limitation  of  the 
current  implementation  is  its  use  of  the  STFT  phase  computed 
for  the  mixed  signal  in  reconstruction.  Table  I  shows  results  of 
applying  the  STFTM  obtained  through  demodulation  with  the 
true  phase  of  the  target  resulting  in  an  average  SNR  of  -6  dB. 
Future  work  will  explore  magnitude-only  reconstruction  [11] 
methods  to  address  this  discrepancy.  We  also  aim  to 
incorporate  existing  methods  for  multi-pitch  analysis  and 
estimation  (e.g.,  [10,  12])  with  the  current  framework  towards  a 
full  separation  system.  Finally,  the  current  framework  may  be 
extended  for  analysis/syn thesis  and  separation  of  speech-like 
sources  (e.g.,  musical  instruments)  due  to  its  representation  of 
harmonic  (e.g.,  an  instrument's  pitch)  and  slowly-varying 
structure  (e.g.,  an  instrument's  timbre,  analogous  to  speech 
formants  in  localized  regions  of  the  STFTM. 
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